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Abstract 

Kernel ridge regression (KRR) is a standard method for performing non-parametric 
regression over reproducing kernel Hilbert spaces. Given n samples, the time and space 
complexity of computing the KRR estimate scale as 0{n^) and 0(n^) respectively, and so 
is prohibitive in many cases. We propose approximations of KRR based on m-dimensional 
randomized sketches of the kernel matrix, and study how small the projection dimension 
m can be chosen while still preserving minimax optimality of the approximate KRR esti¬ 
mate. For various classes of randomized sketches, including those based on Gaussian and 
randomized Hadamard matrices, we prove that it suffices to choose the sketch dimension 
m proportional to the statistical dimension (modulo logarithmic factors). Thus, we ob¬ 
tain fast and minimax optimal approximations to the KRR estimate for non-parametric 
regression. 


1 Introduction 


The goal of non-parametric regression is to make predictions of a response variable Y S M 
based on observing a covariate vector Y £ Y. In practice, we are given a collection of n 
samples, say of covariate-response pairs and our goal is to estimate the regression 

function f*{x) = E[Y \ X = x\. In the standard Gaussian model, it is assumed that the 
covariate-response pairs are related via the model 


Vi = f* {xi) + awi, fori = l,...,n 


( 1 ) 


where the sequence consists of i.i.d. standard Gaussian variates. It is typical to assume 

that the regression function f* has some regularity properties, and one way of enforcing such 
structure is to require f* to belong to a reproducing kernel Hilbert space, or RKHS for short (3, 
0, [^). Given such an assumption, it is natural to estimate f* by minimizing a combination 
of the least-squares fit to the data and a penalty term involving the squared Hilbert norm, 
leading to an estimator known kernel ridge regression, or KRR for short 10, |^). From a 


statistical point of view, the behavior of KRR can be characterized using existing results on 
M-estimation and empirical processes (e.g. 0, SB)- When the regularization parameter 
is set appropriately, it is known to yield a function estimate with minimax prediction error 
for various classes of kernels. 
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Despite these attractive statistical properties, the computational complexity of computing 
the KRR estimate prevent it from being routinely used in large-scale problems. More precisely, 
in a standard implementation the time complexity and space complexity of KRR in 
a standard implementation scale as O(n^) and O(n^), respectively, where n refers to the 
number of samples. As a consequence, it becomes important to design methods to compute 
approximate forms of the KRR estimate, while retaining guarantees of optimality in terms 
of statistical minimaxity. Various authors have taken different approaches to this problem. 
Zhang et al. analyze a distributed implementation of KRR, in which a set of t machines 
each compute a separate estimate based on a random t-way partition of the full data set, 
and combine it into a global estimate by averaging. This divide-and-conquer approach has 


time complexity and space complexity and 0{n‘^ respectively. Zhang et al. 30|] 


give conditions on the number of splits t, as a function of the kernel, under which minimax 
optimality of the resulting estimator can be guaranteed. More closely related to this paper 
are methods that are based on forming a low-rank approximation to the n-dimensional kernel 
matrix, such as the Nystrom methods (e.g. [3, l^)- The time complexity by using a low-rank 
approximation is either O(nr^) or O(n^r), depending on the specific approach (excluding the 
time for factorization^ where r is the maintained rank, and the space complexity is 0{nr). 
Some recent work [1, analyzes the tradeoff between the rank r and the resulting statistical 
performance of the estimator, and we discuss this line of work at more length in Section I.S.31 

In this paper, we consider approximations to KRR based on random projections, also 
known as sketches, of the data. Random projections are a classical way of performing dimen¬ 
sionality reduction, and are widely used in many algorithmic contexts (e.g., see the book 27l | 
and references therein). Our proposal is to approximate n-dimensional kernel matrix by pro¬ 
jecting its row and column subspaces to a randomly chosen m-dimensional subspace with 
m n. By doing so, an approximate form of the KRR estimate can be obtained by solving 
an m-dimensional quadratic program, which involves time and space complexity 0{m^) and 
0{m?). Computing the approximate kernel matrix is a pre-processing step that has time 
complexity 0(n^log(m)) for suitably chosen projections; this pre-processing step is trivially 
parallelizable, meaning it can be reduced to to 0{'n? \og{m)/t) by using t < n clusters. 

Given such an approximation, we pose the following question: how small can the pro¬ 
jection dimension m be chosen while still retaining minimax optimality of the approximate 
KRR estimate? We answer this question by connecting it to the statistical dimension dn of 
the n-dimensional kernel matrix, a quantity that measures the effective number of degrees of 
freedom. (See Section [2.31 for a precise dehnition.) From the results of earlier work on random 


projections for constrained Least Squares estimators (e.g., see (l8|,ll9t]), it is natural to conjec¬ 


ture that it should be possible to project the kernel matrix down to the statistical dimension 
while preserving minimax optimality of the resulting estimator. The main contribution of 
this paper is to conhrm this conjecture for several classes of random projection matrices. 

The remainder of this paper is organized as follows. Section [2] is devoted to further 
background on non-parametric regression, reproducing kernel Hilbert spaces and associated 
measures of complexity, as well as the notion of statistical dimension of a kernel. In Section [3l 
we turn to statements of our main results. Theorem [2] provides a general sufficient condition 
on a random sketch for the associated approximate form of KRR to achieve the minimax risk. 
In Corollary [U we derive some consequences of this general result for particular classes of 
random sketch matrices, and confirm these theoretical predictions with some simulations. We 
also compare at more length to methods based on the Nystrom approximation in Section [3.31 
Section m is devoted to the proofs of our main results, with the proofs of more technical results 
deferred to the appendices. We conclude with a discussion in Section [5l 
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2 Problem formulation and background 

We begin by introducing some background on nonparametric regression and reproducing 
kernel Hilbert spaces, before formulating the problem discussed in this paper. 

2.1 Regression in reproducing kernel Hilbert spaces 

Given n samples {(xj, from the non-parametric regression model ([I|), our goal is to 

estimate the unknown regression function f*. The quality of an estimate / can be measured 
in different ways: in this paper, we focus on the squared L^(Pn) error 

i=l 

Naturally, the difficulty of non-parametric regression is controlled by the structure in the 
function f *, and one way of modeling such structure is within the framework of a reproducing 
kernel Hilbert space (or RKHS for short). Here we provide a very brief introduction referring 
the reader to the books [^, [^, for more details and background. 

Given a space X endowed with a probability distribution P, the space L^(P) consists 
of all functions that are square-integrable with respect to P. In abstract terms, a space 
Ti C L^(P) is an RKHS if for each x € fh, the evaluation function / i—^ f{x) is a bounded 
linear functional. In more concrete terms, any RKHS is generated by a positive semidefinite 
(PSD) kernel function in the following way. A PSD kernel function is a symmetric function 
/C : A X A —>■ M such that, for any positive integer N, collections of points {ui,... ,UAr} and 
weight vector ui S the sum Wj) is non-negative. Suppose moreover that 

for each fixed v £ X, the function u e-)■ IC{u,v) belongs to L^(P). We can then consider the 
vector space of all functions 51 : A —)• M of the form 

N 

9{-) = '^uJi}C{-,Vi) 

1 = 1 

for some integer N, points {ui,..., vn} C A and weight vector w £ By taking the closure 
of all such linear combinations, it can be shown that we generate an RKHS, and one that 
is uniquely associated with the kernel fC. We provide some examples of various kernels and 
the associated function classes in Section 12.31 to follow. 

2.2 Kernel ridge regression and its sketched form 

Given the dataset {(xj, ?/j)}Jh;^, a natural method for estimating unknown function f* £'H\s 
known as kernel ridge regression (KRR): it is based on the convex program 

/t :=argmin|—^(yi-/(xi))^-tAn||/||?^}, (3) 

i=l 

where is a regularization parameter. 

As stated, this optimization problem can be infinite-dimensional in nature, since it takes 
place over the Hilbert space. However, as a straightforward consequence of the representer 
theorem [l3 |. the solution to this optimization problem can be obtained by solving the n- 
dimensional convex program. In particular, let us dehne the empirical kernel matrix, namely 
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the n-dimensional symmetric matrix K with entries Kij = n~^IC{xi, xj). Here we adopt the 
n~^ scaling for later theoretical convenience. In terms of this matrix, the KRR estimate can 
be obtained by first solving the quadratic program 


cjl = arg min 
a;eR" 




T 7^2 T I \ T 

UJ J\ UJ — to —^ “h i\(jJ 
y/n 


}> 


(4a) 


and then outputting the function 


1 "" 

/^(•) = 

i=i 

In principle, the original KRR optimization problem (I4ap is simple to solve: it is an n 
dimensional quadratic program, and can be solved exactly using O(n^) via a QR decompo¬ 
sition. However, in many applications, the number of samples may be large, so that this 
type of cubic scaling is prohibitive. In addition, the n-dimensional kernel matrix K is dense 
in general, and so requires storage of order n? numbers, which can also be problematic in 
practice. 

In this paper, we consider an approximation based on limiting the original parameter 
UJ G M”' to an m-dimensional subspace of M"', where m <C n is the projection dimension. 
We define this approximation via a sketch matrix S G such that the m-dimensional 

subspace is generated by the row span of S. More precisely, the sketched kernel ridge regression 
estimate is given by first solving 

a = arg min i-a^{SK){KS"^)a — a^S -|- XnO^SKS^ai, (5a) 

and then outputting the function 

1 " 

/(•) : = —j= V(5''^a)i/C(-, Xj). (5b) 

Note that the sketched program (I5ap is a quadratic program in m dimensions: it takes as in¬ 
put the m-dimensional matrices {SK'^S'^, SKS'^) and the m-dimensional vector SKy. Con¬ 
sequently, it can be solved efficiently via QR decomposition with computational complexity 
Moreover, the computation of the sketched kernel matrix SK = [S'RTi,... ,SK^ in 
the input can be parallellized across its columns. 

In this paper, we analyze various forms of random sketch matrices S. Let us consider a 
few of them here. 

Sub-Gaussian sketches: We say the row Sj of the sketch matrix is zero-mean 1-sub- 
Gaussian if for any hxed unit vector u G 5”“^, we have 

Si) > t|] < 2e ~ for all (5 > 0. 

Many standard choices of sketch matrices have i.i.d. 1-sub-Gaussian rows in this sense; exam¬ 
ples include matrices with i.i.d. Gaussian entries, i.i.d. Bernoulli entries, or random matrices 
with independent rows drawn uniformly from a rescaled sphere. For convenience, the sub- 
Gaussian sketch matrices considered in this paper are all rescaled so that their rows have the 
covariance matrix 
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Randomized orthogonal system (ROS) sketches: This class of sketches are based on 
randomly sampling and rescaling the rows of a fixed orthonormal matrix H S Examples 

of such matrices include the discrete Fourier transform (DFT) matrix, and the Hadamard 
matrix. More specifically, a ROS sketch matrix S G jg formed with i.i.d. rows of the 

form 


— 



for f = 1, 


m, 


where ii is a random diagonal matrix whose entries are i.i.d. Rademacher variables and 
{pi,... ,Pm} is a random subset of m rows sampled uniformly from the nx n identity matrix 
without replacement. An advantage of using ROS sketches is that for suitably chosen or¬ 
thonormal matrices, including the DFT and Hadamard cases among others, a matrix-vector 
product (say of the form Su for some vector u G ffi”) can be computed in Ci(nlogm) time, 
as opposed to 0{nm) time required for the same operation with generic dense sketches. For 
instance, see Ailon and Liberty [l| and [I^ for further details. Throughout this paper, we 
focus on ROS sketches based on orthonormal matrices H with uniformly bounded entries, 
meaning that \Hij\ < for all entries This entrywise bound is satisfied by Hadamard 

and DFT matrices, among others. 


Sub-sampling sketches: This class of sketches are even simpler, based on sub-sampling the 
rows of the identity matrix without replacement. In particular, the sketch matrix S G 
has rows of the form Sj = Pi, where the vectors {pi,... ,Pm} are drawn uniformly at 
random without replacement from the n-dimensional identity matrix. It can be understood 
as related to a ROS sketch, based on the identity matrix as an orthonormal matrix, and not 
using the Rademacher randomization nor satisfying the entrywise bound. In Appendix 
we show that the sketched KRR estimate ()5ap based on a sub-sampling sketch matrix is 
equivalent to the Nystrom approximation. 


2.3 Kernel complexity measures and statistical guarantees 


So as to set the stage for later results, let us characterize an appropriate choice of the regu¬ 
larization parameter A, and the resulting bound on the prediction error ||/1 — f*\\n- Recall 
the empirical kernel matrix K defined in the previous section: since it is symmetric and pos¬ 
itive definite, it has an eigendecomposition of the form K = UDU"^, where U G is an 

orthonormal matrix, and D G is diagonal with elements > /i 2 > ... > Pn > 0. Using 

these eigenvalues, consider the kernel complexity function 


w(«) = 




1 " 

- Vmin{(52,^ .}, 
n 


( 6 ) 


i=i 


corresponding to a rescaled sum of the eigenvalues, truncated at level <5^. This function arises 
via analysis of the local Rademacher complexity of the kernel class (e.g., [s, 13,17, ^). For a 
given kernel matrix and noise variance u > 0, the critical radius is dehned to be the smallest 
positive solution > 0 to the inequality 


n{5) 


< 


a 


( 7 ) 


Note that the existence and uniqueness of this critical radius is guaranteed for any kernel 
class [^. 
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Bounds on ordinary KRR: The significance of the critical radius is that it can be used 
to specify bounds on the prediction error in kernel ridge regression. More precisely suppose 
that we compute the KRR estimate ([3|) with any regularization parameter A > 2(5^. Then 
with probability at least 1 — ^re guaranteed that 

\\P-n\l<Cu{Xn + 6l}, (8) 


where > 0 is a universal constant (independent of n, a and the kern^. This known result 
follows from standard techniques in empirical process theory (e.g., 2^, ^); we also note that 
it can be obtained as a corollary of our more general theorem on sketched KRR estimates to 
follow (viz. Theorem [2]). 

To illustrate, let us consider a few examples of reproducing kernel Hilbert spaces, and 
compute the critical radius in different cases. In working through these examples, so as to 
determine explicit rates, we assume that the design points are sampled i.i.d. from 

some underlying distribution P, and we make use of the useful fact that, up to constant 
factors, we can always work with the population-level kernel complexity function 


n{6) 



oo 

X]min{52,/ij}, 

i=i 


(9) 


where are the eigenvalues of the kernel integral operator (assumed to be uniformly 

bounded). This equivalence follows from standard results on the population and empirical 
Rademacher complexities 0,0. 


Example 1 (Polynomial kernel). For some integer D > 1, consider the kernel function on 
[0,1] X [0,1] given by /Cpoiy(u, u) = (l -|- {u, v)) . For L) = 1, it generates the class of all 
linear functions of the form f{x) = oq + aix for some scalars (ao,ai), and corresponds to a 
linear kernel. More generally, for larger integers D, it generates the class of all polynomial 
functions of degree at most D — that is, functions of the form f{x) = '^f=oO,jX^ . 

Let us now compute a bound on the critical radius 6n- It is straightforward to show that 
the polynomial kernel is of finite rank at most meaning that the kernel matrix K always 

has at most min{L) -|- 1, n} non-zero eigenvalues. Consequently, as long n > D + 1, there is a 
universal constant c such that 


n{S) < 

V n 

which implies that ^ Consequently, we conclude that the KRR estimate satisifes 

the bound ||/ — f*\\n ^ with high probability. Note that this bound is intuitive, since 

a polynomial of degree D has D + 1 free parameters. 

Example 2 (Gaussian kernel). The Gaussian kernel with bandwidth h > 0 takes the form 
X^ga.u{u,v) = . When defined with respect to Lebesgue measure on the real line, 

the eigenvalues of the kernel integral operator scale as a,- x exp(—vr/i^i^) as ? —)• oo. Based 

on this fact, it can be shown that the critical radius scales as 6l x ^^log(^). Thus, even 
though the Gaussian kernel is non-parametric (since it cannot be specified by a fixed number 
of parametrers), it is still a relatively small function class. 
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Example 3 (First-order Sobolev space). As a final example, consider the kernel defined on 
the unit square [0,1] x [0,1] given by ICsoh{u,v) = min{u, u}. It generates the function class 

i^i[0,l] = {/:[0,l]^]R I /(0) = 0, 

and / is abs. cts. with dx < ooj, 

a class that contains all Lipschitz functions on the unit interval [0,1]. Roughly speaking, we 
can think of the first-order Sobolev class as functions that are almost everywhere differentiable 
with derivative in L^[0,1]. Note that this is a much larger kernel class than the Gaussian 
kernel class. The first-order Sobolev space can be generalized to higher order Sobolev spaces, 
in which functions have additional smoothness. See the book for further details on these 
and other reproducing kernel Hilbert spaces. 

If the kernel integral operator is defined with respect to Lebesgue measure on the unit 
interval, then the population level eigenvalues are given by jij = ( ( 2 j-i)^ 7 r ) for j = 1, 2 ,.... 

Given this relation, some calculation shows that the critical radius scales as (5^ x 
This is the familiar minimax risk for estimating Lipschitz functions in one dimension 


(^ 2/3 

13- 


Lower bounds for non-parametric regression: For future reference, it is also convenient 
to provide a lower bound on the prediction error achievable by any estimator. In order to do 
so, we first define the statistical dimension of the kernel as 

dn := arg min {Jij < S^}, (11) 

and dn = n if no such index j exists. By definition, we are guaranteed that /2j > (5^ for all 
j G {1,2,, dn}. In terms of this statistical dimension, we have 


TZidn) = 


^^2 




L n 


: ^3 

j=dnG3 


1/2 


showing that the statistical dimension controls a type of bias-variance tradeoff. 

It is reasonable to expect that the critical rate 5n should be related to the statistical 
dimension as (5^ x ^ . This scaling relation holds whenever the tail sum satisfies a bound 

of the form V'j dnS'^. Although it is possible to construct pathological examples in 

which this scaling relation does not hold, it is true for most kernels of interest, including all 
examples considered in this paper. For any such regular kernel, the critical radius provides a 
fundamental lower bound on the performance of any estimator, as summarized in the following 
theorem; 


Theorem 1 (Gritical radius and minimax risk). Given n i.i.d. samples {(yi,a;i)}^i from the 
standard non-parametric regression model over any regular kernel class, any estimator f has 
prediction error lower bounded as 

sup ¥.\\J - f*\\l>c^5l, (12) 

ll/*lk<i 

where ci > 0 is a numerical constant, and 6n is the critical radius m- 

The proof of this claim, provided in Appendix IB.ll is based on a standard applicaton of Fano’s 
inequality, combined with a random packing argument. It establishes that the critical radius 
is a fundamental quantity, corresponding to the appropriate benchmark to which sketched 
kernel regression estimates should be compared. 
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3 Main results and their consequences 


We now turn to statements of our main theorems on kernel sketching, as well as a discussion of 
some of their consequences. We first introduce the notion of a X-satisfiable sketch matrix, and 
then show (in Theorem [2|) that any sketched KRR estimate based on a RT-satisfiable sketch 
also achieves the minimax risk. We illustrate this achievable result with several corollaries 
for different types of randomized sketches. For Gaussian and ROS sketches, we show that 
choosing the sketch dimension proportional to the statistical dimension of the kernel (with 
additional log factors in the ROS case) is sufficient to guarantee that the resulting sketch 
will be RT-satisfiable with high probability. In addition, we illustrate the sharpness of our 
theoretical predictions via some experimental simulations. 

3.1 General conditions for sketched kernel optimality 

Recall the definition (jlR of the statistical dimension dn, and consider the eigendecomposition 
K = UDU"^ of the kernel matrix, where U € is an orthonormal matrix of eigenvectors, 

and D = diag{/2i,..., /in} is a diagonal matrix of eigenvalues. Let Ui G denote the left 

block of t/, and similarly, U 2 G denote the right block. Note that the columns of the 

left block Ui correspond to the eigenvectors of K associated with the leading dn eigenvalues, 
whereas the columns of the right block U 2 correspond to the eigenvectors associated with 
the remaining n — dn smallest eigenvalues. Intuitively, a sketch matrix S G is “good” 

if the sub-matrix SUi G is relatively close to an isometry, whereas the sub-matrix 

SU 2 G relatively small operator norm. 

This intuition can be formalized in the following way. For a given kernel matrix RT, a 
sketch matrix S is said to be K-satisfiable if there is a universal constant c such that 

\l{SUifSUi-IdJl^<l/2, and \ISU 2 0^%^ < c 6n, (13) 

where D 2 = diag{/id„+i, ■ ■ ■, Rnj- 

Given this definition, the following theorem shows that any sketched KRR estimate based 
on a iL-satisfiable matrix achieves the minimax risk (with high probability over the noise in 
the observation model): 

Theorem 2 (Upper bound). Given n i.i.d. samples {(yi,Xj)}”^g from the standard non- 
parametric regression model, consider the sketched KRR problem (I5ap based on a K-satisfiable 
sketch matrix S. Then any forXn > 25^, the sketched regression estimate f from equation (I5bp 
satisfies the bound 


\\f-r\\l<Cu{\n + Sl} 

with probability greater than 1 — 

We emphasize that in the case of fixed design regression and for a fixed sketch matrix, 
the RT-satisfiable condition on the sketch matrix S' is a deterministic statement: apart from 
the sketch matrix, it only depends on the properties of the kernel function JC and design 
variables Thus, when using randomized sketches, the algorithmic randomness can be 

completely decoupled from the randomness in the noisy observation model ([T]). 


Proof intuition: The proof of Theorem [2] is given in Section 14.11 At a high-level, it is 
based on an upper bound on the prediction error ||/ — /*||^ that involves two sources of error: 
the approximation error associated with solving a zero-noise version of the KRR problem in 
the projected m-dimensional space, and the estimation error between the noiseless and noisy 
versions of the projected problem. In more detail, letting z* : = {f*{xi), ..., /*(x„)} denote 
the vector of function evaluations defined by /*, consider the quadratic program 

: = arg min | — llz* — + An||\/R5^a||||, (14) 

oeM"* I 2n J 

as well as the associated fitted function p = -^ a\K,{-,Xi). The vector P G is 

the solution of the sketched problem in the case of zero noise, whereas the fitted function /i 
corresponds to the best penalized approximation of f* within the range space of S'^. 

Given this definition, we then have the elementary inequality 

^ii/-rii^< wP-rwl + ii/^-/iin • ( 15 ) 

Approximation error Estimation error 

For a fixed sketch matrix, the approximation error term is deterministic: it corresponds to 
the error induced by approximating f* over the range space of S'^. On the other hand, the 
estimation error depends both on the sketch matrix and the observation noise. In Section [4.II 
we state and prove two lemmas that control the approximation and error terms respectively. 

As a corollary. Theorem [2] implies the stated upper bound ([8|) on the prediction error of 
the original (unsketched) KRR estimate (l3|). Indeed, this estimator can be obtained using the 
“sketch matrix” S = Inxn, which is easily seen to be K-satisfiable. In practice, however, we 
are interested in mx n sketch matrices with m <C re, so as to achieve computational savings. In 
particular, a natural conjecture is that it should be possible to efficiently generate A'-satisfiable 
sketch matrices with the projection dimension m proportional to the statistical dimension 
of the kernel. Of course, one such A'-satisfiable matrix is given by 5 = Uf G ^dnxn^ jg 

not easy to generate, since it requires computing the eigendecomposition of K. Nonetheless, 
as we now show, there are various randomized constructions that lead to A'-satisfiable sketch 
matrices with high probability. 


3.2 Corollaries for randomized sketches 


When combined with additional probabilistic analysis, Theorem [2] implies that various forms 
of randomized sketches achieve the minimax risk using a sketch dimension proportional to the 
statistical dimension dn- Here we analyze the Gaussian and ROS families of random sketches, 
as previously defined in Section 12.21 Throughout our analysis, we require that the sketch 
dimension satisfies a lower obund of the form 


m > 


c dji 

cdnlog^{n) 


for Gaussian sketches, and 
for ROS sketches, 


(16a) 


where dn is the statistical dimension as previously defined in equation (llljl . Here it should 
be understood that the constant c can be chosen sufficiently large (but finite). In addition, 
for the purposes of stating high probability results, we define the function 


(j){m,dn,n) 


cie 


—C2m 


Cl 


g dn log^(n) _|_ g-C2rfn log^ (n) 


for Gaussian sketches, and 
for ROS sketches. 


(16b) 
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where ci,C 2 are universal constants. With this notation, the following result provides a high 
probability guarantee for both Gaussian and ROS sketches: 

Corollary 1 (Guarantees for Gaussian and ROS sketches). Given n i.i.d. samples {(yi, 
from the standard non-parametric regression model o, consider the sketched KRR prob¬ 
lem (|5ap based on a sketch dimension m satisfying the lower bound (IlGal) . Then there is 
a universal constant such that for any > 25^, the sketched regression estimate ib]) 
satisfies the bound 

\\f-r\\l<c'u{^n + 6l} 

with probability greater than 1 — (f>{m,dn,n) — 036 “'^^"''^". 

In order to illustrate Corollary (H let us return to the three examples previously discussed in 
Section [231 To be concrete, we derive the consequences for Gaussian sketches, noting that 
ROS sketches incur only an additional log^(n) overhead. 

• for the D^^-oider polynomial kernel from Example [11 the statistical dimension dn for 
any sample size n is at most D +1, so that a sketch size of order iA +1 is sufficient. This 
is a very special case, since the kernel is finite rank and so the required sketch dimension 
has no dependence on the sample size. 

• for the Gaussian kernel from Example [21 the statistical dimension satisfies the scaling 
dn X -\/log n, so that it suffices to take a sketch dimension scaling logarithmically with 
the sample size. 

• for the first-order Sobolev kernel from Example 31 , the statistical dimension scales as 

dn X so that a sketch dimension scaling as the cube root of the sample size is 

required. 


In order to illustrate these theoretical predictions, we performed some simulations. Be¬ 
ginning with the Sobolev kernel ICsohiu,v) = min{rt, u} on the unit square, as introduced in 
Example [3l we generated n i.i.d. samples from the model (HI) with noise standard deviation 
(T = I, the unknown regression function 


f*{x) = |x -|- 0.5| — 0.5, 


(17) 


and uniformly spaced design points Xj = ^ for z = 1,..., n. By construction, the function f* 
belongs to the first-order Sobolev space with ||/*||-h = 1- As suggested by our theory for the 
Sobolev kernel, we set the projection dimension m = , and then solved the sketched ver¬ 

sion of kernel ridge regression, for both Gaussian sketches and ROS sketches based on the fast 
Hadamard transform. We performed simulations for n in the set {32, 64,128, 256, 512,1024} 
so as to study scaling with the sample size. As noted above, our theory predicts that the 
squared prediction loss ||/ — /*||^ should tend to zero at the same rate as that of the 

unsketched estimator p. Figure [3 confirms this theoretical prediction. In panel (a), we plot 
the squared prediction error versus the sample size, showing that all three curves (original, 
Gaussian sketch and ROS sketch) tend to zero. Panel (b) plots the rescaled prediction error 
n'/'IIZ-rlln versus the sample size, with the relative flatness of these curves confirming the 
7 ^- 2/3 dggay predicted by our theory. 

In our second experiment, we repeated the same set of simulations this time for the Gaus- 
sian kernel }Cga.u{u, v) = with bandwidth h = 0.25, and the function f*{x) = —l-\-2x 
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Pred. error for Sobolev kernel 


Pred. error for Sobolev kernel 




Figure 1. Prediction error versus sample size for original KRR, Gaussian sketch, and ROS 
sketches for the Sobolev one kernel for the function f*{x) = |x + 0.5| — 0.5. In all cases, each 
point corresponds to the average of 100 trials, with standard errors also shown, (a) Squared 
prediction error \\f — /*||^ versus the sample size n G {32,64,128,256,1024} for projection 
dimension m = (b) Rescaled prediction error — /*||^ versus the sample size. 


In this case, as suggested by our theory, we choose the sketch dimension m = [ 1.25-\/log n |. 
Figure [2] shows the same types of plots with the prediction error. In this case, we expect that 
the squared prediction error will decay at the rate ^ . This prediction is confirmed by the 

plot in panel (b), showing that the rescaled error -^j==||/ — /*||^, when plotted versus the 
sample size, remains relatively constant over a wide range. 

3.3 Comparison with Nystrom-based approaches 

It is interesting to compare the convergence rate and computational complexity of our meth¬ 
ods with guarantees based on the Nystrom approximation. As shown in Appendix this 
Nystrom approximation approach can be understood as a particular form of our sketched 
estimate, one in which the sketch corresponds to a random row-sampling matrix. 

Bach analyzed the prediction error of the Nystrom approximation to KRR based on 
uniformly sampling a subset of p-columns of the kernel matrix AT, leading to an overall com¬ 
putational complexity of 0{np^). In order for the approximation to match the performance 
of KRR, the number of sampled columns must be lower bounded as 

P n||diag(A:(K -b An/)"^)||oo logn, 

a quantity which can be substantially larger than the statistical dimension required by our 
methods. Moreover, as shown in the following example, there are many classes of kernel 
matrices for which the performance of the Nystrom approximation will be poor. 

Example 4 (Failure of Nystrom approximation). Given a sketch dimension m < nlog2, 
consider an empirical kernel matrix K that has a block diagonal form diag(A'i, Ar 2 )) where 
Ki G and K 2 G for any integer k < ^log2. Then the probability of not 

sampling any of the last k columns/rows is at least 1 — (1 — k/n)"^ > 1 — > 1/2. This 

means that with probability at least 1/2, the sub-sampling sketch matrix can be expressed as 
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Pred. error for Gaussian kernel 


Pred. error for Gaussian kernel 




Figure 2. Prediction error versus sample size for original KRR, Gaussian sketch, and ROS 
sketches for the Gaussian kernel with the function f*{x) = — 1 + 2x^. In all cases, each 
point corresponds to the average of 100 trials, with standard errors also shown, (a) Squared 
prediction error ||/ — /*||^ versus the sample size n € {32,64,128,256,1024} for projection 
dimension m = [1.25-\/logn|. (b) Rescaled prediction error f — /*||^ versus the sample 

size. 


S = (SijO), where G ^). Under such an event, the sketched KRR (fSal) takes on a 

degenerate form, namely 

a = arg min [SiR'lS'^a — oFSi —+ Xna"’"SiKiSja\, 

6»eK"* 12 y/n J 

and objective that depends only on the first n — k observations. Since the values of the 
last k observations can be arbitrary, this degeneracy has the potential to lead to substantial 
approximation error. 

The previous example suggests that the Nystrom approximation is likely to be very sensi¬ 
tive to non-inhomogeneity in the sampling of covariates. In order to explore this conjecture, 
we performed some additional simulations, this time comparing both Gaussian and ROS 
sketches with the uniform Nystrom approximation sketch. Returning again to the Gaussian 
kernel }Cgau{u, v) = with bandwidth h = 0.25, and the function f*{x) = —1 -p 2x^, 

we first generated n i.i.d. samples that were uniform on the unit interval [0,1]. We then im¬ 
plemented sketches of various types (Gaussian, ROS or Nystrom) using a sketch dimension 
m = [d-y/log n I. As shown in the top row (panels (a) and (b)) of Figure [3l all three sketch 
types perform very well for this regular design, with prediction error that is essentially indis- 
tiguishable from the original KRR estimate. Keeping the same kernel and function, we then 
considered an irregular form of design, namely with k = \y/n\ samples perturbed as follows: 


Unif [0,1/2] 
1 + 


if i = 1,... , n — A: 
for i = k + 1, ■ ■ ■ ,n 


where each Zi ~ A^(0,1/n). The performance of the sketched estimators in this case are shown 
in the bottom row (panels (c) and (d)) of Figure [3| As before, both the Gaussian and ROS 
sketches track the performance of the original KRR estimate very closely; in contrast, the 
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Figure 3. Prediction error versus sample size for original KRR, Gaussian sketch, ROS sketch 
and Nystrom approximation. Left panels (a) and (c) shows \\f — f*\\n versus the sample 
size n £ {32,64,128,256,1024} for projection dimension m = \Ay/logn\. In all cases, each 
point corresponds to the average of 100 trials, with standard errors also shown. Right panels 
(b) and (d) show the rescaled prediction error -^j==||/ — /*||^ versus the sample size. Top 
row correspond to covariates arranged uniformly on the unit interval, whereas bottom row 
corresponds to an irregular design (see text for details). 


Nystrom approximation behaves very poorly for this regression problem, consistent with the 
intuition suggested by the preceding example. 

As is known from general theory on the Nystrom approximation, its performance can 
be improved by knowledge of the so-called leverage scores of the underlying matrix. In 
this vein, recent work by Alaoui and Mahoney suggests a Nystrom approximation non- 
uniform sampling of the columns of kernel matrix involving the leverage scores. Assuming 
that the leverage scores are known, they show that their method matches the performance 
of original KRR using a non-uniform sub-sample of the order trace(K(K -|-log n) 
columns. When the regularization parameter A„ is set optimally—that is, proportional to 
(5^—then apart from the extra logarithmic factor, this sketch size scales with the statistical 
dimension, as dehned here. However, the leverage scores are not known, and their method for 
obtaining a sufficiently approximation requires sampling p columns of the kernel matrix K, 
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where 


p ^ ^ trace(i^) log n. 

For a typical (normalized) kernel matrix K, we have trace(itr) ^ 1; moreover, in order to 
achieve the minimax rate, the regularization parameter should scale with S'^. Putting 
together the pieces, we see that the sampling parameter p must satisfy the lower bound 
P ~ logn. This requirement is much larger than the statistical dimension, and prohibitive 
in many cases: 


• for the Gaussian kernel, we have 6^ x and so p ^ nlog^^^(n), meaning that all 

rows of the kernel matrix are sampled. In contrast, the statistical dimension scales as 
\/log re. 

• for the first-order Sobolev kernel, we have x so that p ^ log n. In contrast, 

the statistical dimension for this kernel scales as 


It remains an open question as to whether a more efficient procedure for approximating the 
leverage scores might be devised, which would allow a method of this type to be statistically 
optimal in terms of the sampling dimension. 


4 Proofs 

In this section, we provide the proofs of our main theorems. Some technical proofs of the 
intermediate results are provided in the appendices. 


4.1 Proof of Theorem [2] 

Recall the definition (I14p of the estimate p, as well as the upper bound (jI5h in terms of ap¬ 
proximation and estimation error terms. The remainder of our proof consists of two technical 
lemmas used to control these two terms. 

Lemma 1 (Control of estimation error). Under the conditions of Theoreml^ we have 

\\P-ffn<cdl (18) 


with probability at least 1 — cie 

Lemma 2 (Control of approximation error). For any K-satisfiable sketch matrix S, we have 

\\P - f*\\l<c{Xn + dl} and H/^'llw < c |l-F ^|. (19) 

These two lemmas, in conjunction with the upper bound (|15p . yield the claim in the theorem 
statement. Accordingly, it remains to prove the two lemmas. 
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4.1.1 Proof of Lemma [T] 


So as to simplify notation, we assume throughout the proof that a = 1. (A simple rescaling 
argument can be used to recover the general statement). Since is optimal for the quadratic 
program da, it must satisfy the zero gradient condition 


-SK(^f* - KS^a^) + XnSKS^a^ = 0 . 


( 20 ) 


By the optimality of a and feasibility of for the sketched problem (I5al) . we have 

IwKs^awl - -^y'^Ks^a + Xn\\^/Ks^a\\l 


< - y=y'^KS'^a^ + A„||\/K5^q:^||| 


n 


Dehning the error vector A : = — a^), some algebra leads to the following inequality 

-\\KA\\l < -(KA,KS^a^) + ^y'^KA + XnWVKS^a^Wl - XjVKS^aWl (21) 

2 ^ ' y/n 

Consequently, by plugging in y = z* + w and applying the optimality condition pOl) . we obtain 
the basic inequality 


-||A'A ||2 < 

2 " - 


-^w^KA -XJVkAW 


( 22 ) 


The following lemma provides control on the right-hand side: 
Lemma 3. With probability at least 1 — we have 


y/n 


uF K A 


< 


'Q 5 n\\KA \\2 + 25l for all\\FKA\\ 2 <l, 

26 n\\KA \\2 + 251 \\FKA \\2 + for all \\VKA \\2 > 1. 


(23) 


See Appendix IB . 2 1 for the proof of this lemma. 


Based on this auxiliary result, we divide the remainder of our analysis into two cases: 


Case 1: If \\y/KA \\2 < 1, then the basic inequality (|22p and the top inequality in Lemma [3] 
imply 


1 

2 



< 


1 T’ - 

—;=W^KA 

n 


< 6<5„||A:A||2 + 2,5^ 


(24) 


with probability at least 1 — Note that we have used that fact that the randomness 

in the sketch matrix S is independent of the randomness in the noise vector w. The quadratic 
inequality ()24p implies that ||iLA ||2 < C(5„ for some universal constant c. 
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Case 2: If ||v^A ||2 > 1, then the basic inequality (1221) and the bottom inequality in 
Lemma [3] imply 


^\\KA\\l < 2dn\\KAh + 26^J^/KAh + - XJ^KAg 

with probability at least 1 — If A„ > 25^, then under the assumed condition 

II^Alb > 1, the above inequality gives 

^\\KA\\l < 26n\\KAh + < \\\KA\\l + 

By rearranging terms in the above, we obtain HATAHl < C(5^ for a universal constant, which 
completes the proof. 

4.1.2 Proof of Lemma [2] 

Our goal is to show that the bound 

^\\z* - V^KS^a^Wl + XnWVKS^a^Wl < c{Xn + 6^}. 

In fact, since is a minimizer, it suffices to exhibit some a £ M”* for which this inequality 
holds. Recalling the eigendecomposition K = UDU"^, it is equivalent to exhibit some a G 
such that 


i||r - DS^ag + Xna^SDS^a < c{An + 5^}, (25) 

where S = SU is the transformed sketch matrix, and the vector 6 * = n~^/'^Uz* G M"' satisfies 
the ellipse constraint \\D~^/‘^ 6*\\2 < 1. 

We do so via a constructive procedure. First, we partition the vector 9* G M” into two 
sub-vectors, namely 9\ G and 62 G Similarly, we partition the diagonal matrix D 

into two blocks, Di and D 2 , with dimensions dn and n — dn respectively. Under the condition 
m > dn, we may let Si G denote the left block of the transformed sketch matrix, and 

similarly, let S 2 G denote the right block. In terms of this notation, the assumption 

that S is iL-satisfiable corresponds to the inequalities 

|||5f5i-IdJ|U<^, and |||52 v^|||.p < c(i„. (26) 

As a consequence, we are guarantee that the matrix Sf Si is invertible, so that we may define 
the m-dimensional vector 


S = Ai(SfSi)-i(Ili)-i/3^ gM™, 


Recalling the disjoint partition of our vectors and matrices, we have 


lir 


DS'^al 


2 

2 



DiSfah + m 

—V-^ ' 

=0 


D2S^Si{SfSi)-^D^^ 

' V" 



(27a) 
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By the triangle inequality, we have 


Ti < mh + \\D2SjSi{SfSi)-^DY^ei\\2 

< 11^2112 + lll^25j|||„p|||5iLp|||(5f5i)-lLp|||I)-'/2L 

< II 02 II 2 + IIIV^|||op|||52V^Lp|||5i|||„p|||(5f5i)-i||l„p|^ 

Since ||L)~^/^0* II2 < 1, we have \\D^ II2 < 1 and moreover 


l|02lli 


E (^1)' 


j=d„+l 


<6i 


n 


E 

j=dn + l 


io*r 

fij 


< s 


2 

72’ 


since/Ij < 6^ for all j > d^+l. Similarly, we have |||a/I^|||op < < ^n, and jD^ ^^^|||„p < 

Putting together the pieces, we have 

Ti<dn + |||52^/:^Lp|||5lLp|||(5f5l)-^Lp < {c6n) yi 2 = c'6n, (27b) 

where we have invoked the ^-satisfiability of the sketch matrix to guarantee the bounds 
|||5'i|||op < vW2, III-S')lop > 1/2 and |.S 2 \/I^|op < c5n- Bounds (I27ap and (I27bp in con¬ 
junction guarantee that 

\\e*-DS^a\\l<c6'i, (28a) 


where the value of the universal constant c may change from line to line. 

Turning to the remaining term on the left-side of inequality p25p . applying the triangle 
inequality and the previously stated bounds leads to 

oFSDS^a < ||I?r^/'0^||i + i-Da^'^^iopi^iiop 

• i(5f5i)-i|opi7iE'ioppr'^'0^ii2 
< 1 + {c 6 n) vE 2 ^ (1) < c'. (28b) 

Combining the two bounds (I28ap and p28bp yields the claim (I25p . 


5 Discussion 

In this paper, we have analyzed randomized sketching methods for kernel ridge regression. 
Our main theorem gives sufficient conditions on any sketch matrix for the sketched estimate 
to achieve the minimax risk for non-parametric regression over the underlying kernel class. 
We specialized this general result to two broad classes of sketches, namely those based on 
Gaussian random matrices and randomized orthogonal systems (ROS), for which we proved 
that a sketch size proportional to the statistical dimension is sufficient to achieve the minimax 
risk. More broadly, we suspect that sketching methods of the type analyzed here have the 
potential to save time and space in other forms of statistical computation, and we hope that 
the results given here are useful for such explorations. 


17 












Acknowledgements 


All authors were partially supported by Office of Naval Research MURI grant N00014-11-1- 
0688, National Science Foundation Grants CIF-31712-23800 and DMS-1107000, and Air Force 
Office of Scientific Research grant AFOSR-FA9550-14-1-0016. In addition, MP was supported 
by a Microsoft Research Fellowship. 


A Subsampling sketches yield Nystrom approximation 

In this appendix, we show that the the sub-sampling sketch matrix described at the end of 
Section 12.21 coincides with applying Nystrom approximation to the kernel matrix. 

We begin by observing that the original KRR quadratic program (j4ap can be written in 

the equivalent form min |A||u|P + Koj\ such that y — ^/nKu = u. The dual of 

a;eM",-ueR" 

this constrained quadratic program (QP) is given by 

{t = are max { - ■ (29) 

The KRR estimate and the original solution can be recovered from the dual solution 
via the relation /^(•) = ^ 

Now turning to the the sketched KRR program (I5ap . note that it can be written in the 
equivalent form min | AllulP -|- XnCx^SKS"^a \ subject to the constraint y — ^JnKS'^a = u. 

The dual of this constrained QP is given by 

= arg max { - }, (30) 

where K = KS'^{SKS'^)~^SK is a rank-m matrix in In addition, the sketched KRR es¬ 

timate /, the original solution a and the dual solution are related by f[-) = ^= 

and a = ^{SKS'^)-^SK^}. 

When S is the sub-sampling sketch matrix, the matrix K = KS"^{SKS'^)~^ SK is known 
as the Nystrom approximation [^. Consequently, the dual formulation of sketched KRR 
based on a sub-sampling matrix can be viewed as the Nystrom approximation as applied to 
the dual formulation of the original KRR problem. 


B Technical Proofs 

B.l Proof of Theorem [T] 


We begin by converting the problem to an instance of the normal sequence model Ul]. Recall 
that the kernel matrix can be decomposed as K = U'^DU, where U € is orthonormal, 

and D = diag{/ii,..., Jin}- Any function f* gH can be decomposed as 


1 " 

r = ^TJC{;xj){u^nj+9, 

,=i 


(31) 
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for some vector /3* G M”, and some function gr G is orthogonal to span{ IC{-,Xj),j = 
1,... ,n}. Consequently, the inequality H/^Hh < 1 implies that 


1 


,Xj){U^P*)j = {U^P*)^U^DU{U^P*) = \\^/D^*\\l<l. 


n 


Moreover, we have /*(x”) = ^/nU'^D/3*, and so the original observation model ([T|) has the 
equivalent form y = ^/nU'^0* + w, where 6* = DjS*. In fact, due to the rotation invariance of 
the Gaussian, it is equivalent to consider the normal sequence model 

y = e* + ^. (32) 

Vn 

Any estimate 6 of 9* defines the function estimate /(•) = {U"’"D~^9)i, and by 

construction, we have ||/ —/*||^ = 11^ —Finally, the original constraint ||\/I)/5*||2 < 1 is 
equivalent to \\D~^/‘^9*\\2 < 1. Thus, we have a version of the normal sequence model subject 
to an ellipse constraint. 


After this reduction, we can assume that we are given n i.i.d. observations y^ = {yi,..., y^}, 
and our goal is to lower bound the Euclidean error ||0 — 0*||| of any estimate of 9*. In order 
to do so, we first construct a (I/2-packing of the set 13 = {0 G M" | \\D~^I‘^9\\2 < 1}, say 
{9^,9^'\. Now consider the random ensemble of regression problems in which we first 
draw an index A uniformly at random from the index set [M], and then conditioned on A = a, 
we observe n i.i.d. samples from the non-parametric regression model with f* = /“. Given 
this set-up, a standard argument using Fano’s inequality implies that 

prii/_ rii^ > 111 > 1 - + 

LIU / lln - 4 J - 

where I{y^-,A) is the mutual information between the samples y^ and the random index A. 
It remains to construct the desired packing and to upper bound the mutual information. 

For a given d > 0, define the ellipse 


£1(5) :={ 


= <9£ 


E 


mm 




< 1 


(33) 


By construction, observe that £{6) is contained within Hilbert ball of unit radius. Conse¬ 
quently, it suffices to construct a 5/2-packing of this ellipse in the Euclidean norm. 

Lemma 4. For any 5 G (0,5^]; there is a 5/2-packing of the ellipse £{5) with cardinality 

\ogM = ^dn. (34) 

d4 

Taking this packing as given, note that by construction, we have 

n (na \2 

linii = 5^^-^ < 5^, and hence \\9^ - < 46^. 

i=i 
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In conjunction with concavity of the KL diveregence, we have 

M , M 


WV) < ^ E o(n>“ II P*) 


1 n 


E ii«“ 


a,6=1 a,6=1 

For any S such that log 2 < ^<5^ and S < 6n, "we have 


- 0-2 


P 




Moreover, since the kernel is regular, we have a'^dn > for some positive constant c. Thus, 
c5^ 

setting 5"^ = yields the claim. 

Proof of Lemma [4l It remains to prove the lemma, and we do so via the probabilistic 
method. Consider a random vector 9 G M” of the form 


y/^ldn 




^/‘^dn 


W2 


y/‘^dn 


Wd„ 0 


0 


(35) 


where w = {wi,... ,Wd^)'^ ~ N{0,ld^) is a standard Gaussian vector. We claim that a 
collection of M such random vectors {0^,... ,9^}, generated in an i.i.d. manner, defines the 
required packing with high probability. 

On one hand, for each index a G [M], since 6^ < 6^ < Jij for each j < dn, we have 

110“ III = corresponding to a normalized y^.yariate. Consequently, by a combination of 

standard tail bounds and the union bound, we have 


||0“||| < 1 for all a G [M] 


> 1 - Me"i6 


Now consider the difference vector 9°“ — 9^. Since the underlying Gaussian noise vectors 
and w’^ are independent, the difference vector follows a N{0,2Im) distribution. 

Gonsequently, the event \\9°‘ — 9^ \\2 > | is equivalent to the event ■\/2||0||2 > |, where 9 

is a random vector drawn from the original ensemble. Note that ||0||| = 


‘2dn 


Then 

a combination of standard tail bounds for x^-distributions and the union bound argument 
yields 


\\9‘"-0 \\ 2 >^ for all a, 6 G [M] 
Gombining the last two display together, we obtain 


> 1 — e 16 


P 




lirill < 1 and ||r - 9% > — for all a, 6 G [M] 


> 1 — M e 16 — e le . 
This probability is positive for logM = dn/64. 


B.2 Proof of Lemma [3] 

For use in the proof, for each d > 0, let us define the random variable 


Zn{S) = sup 

I|vWA||2<1 

||A'A||2<<5 


1 


KA 


n 


(36) 
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Top inequality in the bound (1231) : If the top inequality is violated, then we claim that we 
must have Zn{6n) > 25^. On one hand, if the bound ([231) is violated by some vector A e 
with ||i^A ||2 < Sn, then we have 


25l < 


y/n 


uF 


< Zn{6n)- 


On the other hand, if the bound is violated by some function with ||i^A ||2 > 5n, then we can 

define the rescaled vector A = m F m A, for which we have 

||KA||2 ’ 


\KMo=5r, 


and ||^/^A ||2 = 


lATAI 


■||\/^A ||2 < 1 


showing that Zn{5n) A 25^ as well. 

When viewed as a function of the standard Gaussian vector w G M”, it is easy to see that 
Zni^n) is Lipschitz with parameter 6nly/n. Consequently, by concentration of measure for 
Lipschitz functions of Gaussians 1^, we have 


F[Zn(Sn) > F\Zn{5n)] + t] < e 


(37) 


Moreover, we claim that 


E|Z„(i„)| < 




1 

-E 

n “ 

1=1 


mm 


(ii) 


7e(<5„) 


(38) 


where inequality (ii) follows by definition of the critical radius (recalling that we have set 
cr = 1 by a rescaling argument). Setting t in the tail bound (l37|) . we see that P[Z„((5„) > 
2h^] < which completes the proof of the top bound. 

It only remains to prove inequality (i) in equation (I38p . The kernel matrix K can be 
decomposed as K = DU, where D = diag{/2i,..., jhri}, and [/ is a unitary matrix. Defining 
the vector (3 = DUA, the two constraints on A can be expressed as ||D“^/^/3||2 < 1 and 
||/3||2 < S. Note that any vector satisfying these two constraints must belong to the ellipse 


n o2 


o2 

= |/3 G M"' I — < 2 where uj = maxjd^, 


j=l 3 


Consequently, we have 


E[E„(5n)] <E snv^\{U^w, /3)| 

L/3e£Vn' 


= E 


sup^|(u;, /3)| 
p&e y/n' 


since U'^w also follows a standard normal distribution. By the Cauchy-Schwarz inequality, 
we have 


E 


sup^|(u;, /3)| 


< 



Ujw- 



7 ?(< 5 n) 


where the final step follows from Jensen’s inequality. 


21 





















Bottom inequality in the bound (1231) : We now turn to the proof of the bottom inequality. 
We claim that it suffices to show that 


1 T ~ 

w^KA 

n 


< 2 6n\\KAh + 2 5l + 


16' 


\KA 


(39) 


for all A G M” such that ||\/KA ||2 = 1. Indeed, for any vector A G M"’ with ||\/KA ||2 > 1, we 
can define the rescaled vector A = A/||\/KA|| 2 , for which we have ||\/^A ||2 = 1. Applying 
the bound ()39p to this choice and then multiplying both sides by ||\/^A|| 2 , we obtain 


KA 


n 


< 2 5n\\KAh + 2 5l\\^Ah + 


1 II^AIIi 

16 ||/^ A ||2 


< 2 5n\\KAh + 2 5l\\^fKAh + 


16' 


I a: A 


|2 

l2! 


as required. 

Recall the family of random variables previously dehned (1361) . For any u > 5n, have 


W\Zn{u)\ = 'R{u) 


n{u) w n{5h 

u - < u —-— 

U dn 


(ii) 

< uSn, 


where inequality (i) follows since the function u is non-increasing, and step (ii) follows 

by our choice of dn- Setting t = I 2 in the concentration bound (1371) . we conclude that 

F[Zn{u) > u6n + —] < for each u> 6n- (40) 

We are now equipped to prove the bound (l3^ via a “peeling” argument. Let £ denote the 
event that the bound (j39p is violated for some vector A with ||\/^A ||2 = 1. For real numbers 
0 < a < 6, let £{a,b) denote the event that it is violated for some vector with ||\/KA ||2 = 1 
and ||A'A ||2 G [a,h]. For m = 0,1,2,..., define Um = 2^6n- We then have the decomposition 
£ = £{0,uo) U ( Um=o hence by union bound, 

CX> 

P[f] < P[f(0,Uo)] + ^ F[£{Um,Um+l)]- (41) 

m=0 

The hnal step is to bound each of the terms in this summation, Since uq = 5n, we have 

P[T(0,no)] < F[Zn{dn) > 26^] < (42) 

On the other hand, 

A with ||\/KA ||2 = 


suppose that £{um,Um+i) holds, meaning that there exists some vector 
1 and ||iLA ||2 G [um,Um+i] such that 


w^KA 


n 


> 2 5n\\KAh + 2 6l + —\\KAh 

16 

> 25nUm + 26n + 

16 

^ dnUm+l + 
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where the second inequality follows since ||il'A ||2 > Um', and the third inequality follows 

V? 

since Um+i = ‘^Um- This lower bound implies that Z„(um+i) > ^nUm+i H—whence the 
bound (flOl) implies that 

Combining this tail bound with our earlier bound (I42p and substituting into the union 
bound (|4T]) yields 

CX> 

F[£] < ^ exp ( - cn2‘^^6l) < 

m=0 


as claimed. 


B.3 Proof of Corollary [T] 

Based on Theorem [2l we need to verify that the stated lower bound (jlGap on the projection 
dimension is sufficient to guarantee that that a random sketch matrix is ilT-satisfiable is high 
probability. In particular, let us state this guarantee as a formal claim: 

Lemma 5. Under the lower bound (II Gal) on the sketch dimension, a {Gaussian, ROS} random 
sketch is K-satisfiable with probability at least 4>{m,dn,n). 

We split our proof into two parts, one for each inequality in the definition (I13h of K- 
satisfiability. 


B.3.1 Proof of inequality (i): 

We need to bound the operator norm of the matrix Q = Uf S'^SUi — Id„, where the matrix 
Ui € has orthonormal columns. Let {u^,...,u^} be a 1/2-cover of the Euclidean 

sphere by standard arguments 0 , we can find such a set with N < elements. 

Using this cover, a straightforward discretization argument yields 

III Q III op < 4 max {v^, Qv^) = 4 max [vY \ S'^ S - In\v^, 

where tP : = Uiv^ € and Q = S'^S — In- In the Gaussian case, standard sub-exponential 

bounds imply that P[(u)'^Qu^ > 1/8] < cie”'^^™', and consequently, by the union bound, we 
have 


P 


> 1/2] < cie 


—C2m+4dn 


< cie 


where the second and third steps uses the assumed lower bound on m. In the ROS case, 
results of Krahmer and Ward imply that 

>l/2]<cie 


where the final step uses the assumed lower bound on m. 


B.3.2 Proof of inequality (ii): 

We split this claim into two sub-parts: one for Gaussian sketches, and the other for ROS 
sketches. Throughout the proof, we make use of the n x n diagonal matrix D = diag(0d„, 11 * 2)5 
with which we have SU 2 dY'^ = SUD^/"^. 
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Gaussian case: By the definition of the matrix spectral norm, we know 


\lSUD^/\^:= sup {u,Sv), (43) 

v&£ 


where £1 = {u G M” | ||[/L>t!||2 < 1}, and = {u £ M™' | ||u||2 = 1}. 

We may choose a 1/2-cover {u^,... ,u^} of the set 5"*“^ of the set with logM < 2m 
elements. We then have 


\lSUD^/\, < max sup(m-^ , Sv) + - sup {u, Sv) 
je[M] 2 ^^gdn-1 

v&£ 

= max sup(«^ Sv) + 
je[M] v&£ 4 


and re-arranging implies that 


\lSUD^/%, < 2 max sup(ri-^ , Sv) . 
je[M]v&£ 

' -v;-' 

z 


For each fixed G consider the random variable := sup^g£(u-^, Sv). It is equal 

in distribution to the random variable V{g) = sup^g£(5, v), where g G M"’ is a standard 

Gaussian vector. For g,g' G M"", we have 


1^(5) -^(5')! < ^ sup 1(5-5, u)| 


^ v&£ 


< 


2\\D. 


1/2,, 


2 ( 5 „ 


m 


-\\g-g'h < ^h-g' 


m 


where we have used the fact that gj < 5“^ for all j > dn + 1- Co nseq uently, by concentration 
of measure for Lipschitz functions of Gaussian random variables [15|, we have 


F[V{g) > E[-F(5)] +t] < e 


(44) 


Turning to the expectation, we have 


E[^(5)] 



< 21 




m 



E n 

j = dr, + l gj 


n 


< 2(i„ 


(45) 


where the last inequality follows since m > n6^ and y — ^ < 6^. Combining the pieces, 
we have shown have shown that ¥[Z^ > co(l -|- e)(5„] < for each j = 1,..., M. Finally, 

setting t = c6n in the tail bound (1441) for a constant c > 1 large enough to ensure that 
> 21ogM. Taking the union bound over all j G [M] yields 

which completes the proof. 
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ROS case: Here we pursue a matrix Chernoff argument analogous to that in the paper [24l |. 
Letting r S {—1,1}"" denote an i.i.d. sequence of Rademacher variables, the ROS sketch can 
be written in the form S = PH diag(r), where P is a partial identity matrix scaled by n/m, 
and the matrix H is orthonormal with elements bounded as \Hij\ < cjy/n for some constant 
c. With this notation, we can write 


-l m 

|||PRdiag(r)P^/2|||2^ = ll|—X] 

i=l 

where Vi € M” are random vectors of the form diag(r)iLe, where e G M"" is chosen 

uniformly at random from the standard Euclidean basis. 

We first show that the vectors {vi}^^ are uniformly bounded with high probability. Note 
that we certainly have maxjg[^] ||uj||2 < rnax^gj^] Fj{r), where 

Fj{r) := ^/n\\D^^‘^ diag{r)Hej \\2 = diag(Rej)r||2. 

Begining with the expectation, define the vector r = diag{Hej)r, and note that it has entries 
bounded in absolute value by cj^fn. Thus we have. 


K[Fj{r)] < nE[?^P»?] 


1/2 


< c 




^ flj < 
j=d„+l 


For any two vectors r, F G M”, we have 

F{r)-F{r') < v^||r - r'||2||P^'^^ diag(iLej)||2 < 6n- 

Consequently, by concentration results for convex Lipschitz functions of Rademacher vari¬ 
ables 1^, we have 


Fj{r) > co\/n(5^1ogn 


< cie 


—C2nS'^ log^ n 


Taking the union bound over all n rows, we see that 

max||ui||2 < maxPj(r) < 4-v/n(5^log(n) 

ie[n] /e[n] 

with probabablity at least 1 — Finally, a simple calculation shows that 

|||E[i;ii;f]|||op < Consequently, by standard matrix Chernoff bounds 2a, [2^, we have 


^ m 

\\—^ViV. 

m 

2 = 1 


III op P 




(46) 


from which the claim follows. 
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